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Abstract 

In this paper we revisit the classical regular expression matching problem, namely, given a regular 
expression R and a string Q, decide if Q matches one of the strings specified by R. Let m and n be the 
length of R and Q, respectively. On a standard unit-cost RAM with word length w > logn, we show 
that the problem can be solved in 0{m) space with the following ruiming times: 

'0(n2i^ +mlogw) iim>w 
0{n log m + m log m) if y'nj < m < w 

O (mm{n + ,n log m + rn log m)) if m < y'w. 

This improves the best known time bound among algorithms using 0(m) space. Whenever w > log^ n 
it improves all known time bounds regardless of how much space is used. 

1 Introduction 

Regular expressions are a powerful and simple way to describe a set of strings. For this reason, they are often 
chosen as the input language for text processing appHcations. For instance, in the lexical analysis phase of 
compilers, regular expressions are often used to specify and distinguish tokens to be passed to the syntax 
analysis phase. Utilities such as Grep, the programming language Perl, and most modern text editors provide 
mechanisms for handling regular expressions. These applications all need to solve the classical Regular 
Expression Matching problem, namely, given a regular expression R and a string Q, decide if Q matches 
one of the strings specified by R. 

The standard textbook solution, proposed by Thompson [11] in 1968, constructs a non- deterministic 
finite automaton (NFA) accepting all strings matching R. Subsequently, a state-set simulation checks if the 
NFA accepts Q. This leads to a simple 0{nm) time and 0{m) space algorithm, where m and n are the 
number of symbols in R and Q, respectively. The full details are reviewed later in Sec. [21 and can found in 
most textbooks on compilers (e.g. Aho et. al. [1]). Despite the importance of the problem, it took 24 years 
before the 0{nm) time bound was improved by Myers [8] in 1992, who achieved 0{'^^ + {n + m) logn) 
time and 0{-^^) space. For most values of m and n this improves the 0{nm) algorithm by a O(logn) 
factor. Currently, this is the fastest known algorithm. Recently, Bille and Farach-Colton [4] showed how 
to reduce the space of Myers' solution to 0{n). Alternatively, they showed how to achieve a speedup of 
O(logm) over Thompson's algorithm while using 0{m) space. These results are all valid on a unit-cost 
RAM with w-bit words and a standard instruction set including addition, bitwise boolean operations, shifts, 
and multiplication. Each word is capable of holding a character of Q and hence w > logn. The space 
complexities refer to the number of words used by the algorithm, not counting the input which is assumed 
to be read-only. All results presented here assume the same model. In this paper we present new algorithms 
achieving the following complexities: 

*The IT University of Copenhagen, Rued Langgaards Vej 7, 2300 Copenhagen S, Denmark. Email: beetleSitu.dk. An 
extended abstract of this paper appeared in Proceedings of the 33rd International Colloquium on Automata, Languages and 
Programming, 2006. 
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Theorem 1 Given a regular expression R and a string Q of lengths m and n, respectively, REGULAR 
Expression Matching can be solved using 0{m) space with the following running times: 



This represents the best known time bound among algorithms using 0{m) space. To compare these with 
previous results, consider a conservative word length oi w = logn. When the regular expression is "large", 
e.g., m > logn, we achieve an 0{ io'giog„ ) factor speedup over Thompson's algorithm using 0{m) space. 
Hence, we simultaneously match the best known time and space bounds for the problem, with the exception 
of an O(loglogrt) factor in time. More interestingly, consider the case when the regular expression is 
"small", e.g., m — O(logn). This is usually the case in most applications. To beat the 0(nlog7i) time 
of Thompson's algorithm, the fast algorithms [4, 8] essentially convert the NFA mentioned above into a 
deterministic finite automaton (DFA) and then simulate this instead. Constructing and storing the DFA 
incurs an additional exponential time and space cost in m, i.e., 0(2'") ^ 0{n). However, the DFA can now 
be simulated in 0{n) time, leading to an 0{n) time and space algorithm. Surprisingly, our result shows that 
this exponential blow-up in m can be avoided with very little loss of efficiency. More precisely, we get an 
algorithm using O(nloglogn) time and O(logn) space. Hence, the space is improved exponentially at the 
cost of an O(loglogn) factor in time. In the case of an even smaller regular expression, e.g., m = 0(\/logn), 
the slowdown can be eliminated and we achieve optimal 0(n) time. For larger word lengths our time bounds 
improve. In particular, when w > log n log log n the bound is better in all cases, except for ^/w < m < w, 
and when w > log^ n it improves all known time bounds regardless of how much space is used. 

The key to obtain our results is to avoid explicitly converting small NFAs into DFAs. Instead we show how 
to effectively simulate them directly using the parallelism available at the word-level of the machine model. 
The kind of idea is not new and has been applied to many other string matching problems, most famously, 
the Shift-Or algorithm [3], and the approximate string matching algorithm by Myers [9]. However, none of 
these algorithms can be easily extended to Regular Expression Matching. The main problem is the 
complicated dependencies between states in an NFA. Intuitively, a state may have long paths of e-transitions 
to a large number of other states, all of which have to be traversed in parallel in the state-set simulation. To 
overcome this problem we develop several new techniques ultimately leading to Theorem ^ For instance, 
we introduce a new hierarchical decomposition of NFAs suitable for a parallel state-set simulation. We also 
show how state-set simulations of large NFAs efficiently reduces to simulating small NFAs. 

The results presented in this paper are primarily of theoretical interest. However, we believe that most of 
the ideas are useful in practice. The previous algorithms require large tables for storing DFAs, and perform 
a long series of lookups in these tables. As the tables become large we can expect a high number of cache- 
misses during the lookups, thus limiting the speedup in practice. Since we avoid these tables, our algorithms 
do not suffer from this defect. 

The paper is organized as follows. In Sec. |2] we review Thompson's NFA construction, and in Sec. 13 
we present the above mentioned reduction. In Sec. ^ we present our first simple algorithm for the problem 
which is then improved in Sec.|31 Combining these algorithms with our reduction leads to Theorem ^ We 
conclude with a couple of remarks and open problems in Sec. 

2 Regular Expressions and Finite Automata 

In this section we briefly review Thompson's construction and the standard state-set simulation. The set of 
regular expressions over an alphabet E are deflned recursively as follows: 

• A character a € S is a regular expression. 

• If 5* and T are regular expressions then so is the concatenation, (S) ■ (T) , the union, (S) \ (T) , and the 




star, {S)* . 
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Figure 1: Thompson's NFA construction. The regular expression for a character a S S corresponds to NFA 
(a). If S and T are regular expressions then N{ST), N{S\T), and N{S*) correspond to NFAs (6), (c), and 
(rf), respectively. Accepting nodes are marked with a double circle. 

Unnecessary parentheses can be removed by observing that • and | is associative and by using the standard 
precedence of the operators, that is * precedes •, which in turn precedes |. We often remove the • when 
writing regular expressions. 

The language L{R) generated by R is the set of all strings matching R. The parse tree T{R) of R is the 
binary rooted tree representing the hiearchical structure of R. Each leaf is labeled by a character in S and 
each internal node is labeled either •, |, or *. A finite automaton is a tuple A = (V, E, 6, 9, (/)), where 

9 V is a. set of nodes called states, 

• E is set of directed edges between states called transitions, 

• S : E ^ 'SU {e} is a function assigning labels to transitions, and 

• 0,if) gV are distinguished states called the start state and accepting state, respectively^. 

Intuitively, A is an edge-labeled directed graph with special start and accepting nodes. A is a deterministic 
finite automaton (DFA) if A does not contain any e-transitions, and all outgoing transitions of any state have 
different labels. Otherwise, A is a non- deterministic automaton (NFA). We say that A accepts a string Q if 
there is a path from to (p such that the concatenation of labels on the path spells out Q. Thompson [11] 
showed how to recursively construct a NFA N{R) accepting all strings in L{R). The rules are presented 
below and illustrated in Fig. ^ 

• N{a) is the automaton consisting of states 9a, (f>a, and an a-transition from 9a to 0^. 

• Let N{S) and N{T) be automata for regular expressions S and T with start and accepting states 9s, 
9t, (t>s, and 0t, respectively. Then, NFAs N{S ■ T), N{S\T), and N{S*) are constructed as follows: 

N{ST): Add start state 9st and accepting state 0st, and e-transitions {9st,9s), (0s,6't), and 
{4't,4>St)- 

N{S\T): Add start state 9s^rp and accepting state (j)s\Tj and add e-transitions {9s\tj(^s), {Ss\Tt(^t): 

{(l)s,(f>s\T), and {(t>T,(t)s\T)- 
N{S*): Add a new start state 9s* and accepting state 05*, and e-transitions {9s*, 9s), {9s*,4's-'), 

(0s, 0s*)' and (0s, ^'s)- 

Readers familiar with Thompson's construction will notice that N{ST) is slightly different from the usual 
construction. This is done to simplify our later presentation and does not affect the worst case complexity of 
the problem. Any automaton produced by these rules we call a Thompson-NFA (TNFA). By construction, 
N(R) has a single start and accepting state, denoted 9 and 0, respectively. 9 has no incoming transitions and 

^Sometimes NFAs are allowed a set of accepting states, but this is not necessary for our purposes. 
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(j> has no outgoing transitions. The total number of states is 2m and since each state has at most 2 outgoing 
transitions that the total number of transitions is at most Am. Furthermore, all incoming transitions have 
the same label, and we denote a state with incoming a-transitions an a-state. Note that the star construction 
in Fig. ^d) introduces a transition from the accepting state of N{S) to the start state of N{S). All such 
transitions are called back transitions and all other transitions are forward transitions. We need the following 
property. 

Lemma 1 (Myers [8]) Any cycle-free path in a TNFA contains at most one back transition. 

For a string Q of length n the standard state-set simulation of N{R) on Q produces a sequence of state-sets 
. . . , Sn- The ith set S'i, < i < n, consists of all states in N{R) for which there is a path from 9 that 
spells out the ith prefix of Q. The simulation can be implemented with the following simple operations. For 
a state-set S and a character a S S, define 

Move(S', a): Return the set of states reachable from S via a single a-transition. 
Close(S'): Return the set of states reachable from S via or more e-transitions. 

Since the number of states and transitions in N{R) is 0{m), both operations can be easily implemented in 
0{m) time. The Close operation is often called an e-closure. The simulation proceeds as follows: Initially, 
So := Close({6l}). If Q[j] = a, 1 < j < n, then Sj := Close(Move(S'j_i, a)). Finally, Q € L{R) iff G Sn. 
Since each state-set 5*^ only depends on Sj-i this algorithm uses 0{mn) time and 0(m) space. 

3 From Large to Small TNFAs 

In this section we show how to simulate N[R) by simulating a number of smaller TNFAs. We will use this 
to achieve our bounds when R is large. 

3.1 Clustering Parse Trees and Decomposing TNFAs 

Let i? be a regular expression of length m. We first show how to decompose N{R) into smaller TNFAs. This 
decomposition is based on a simple clustering of the parse tree T{R). A cluster C is a connected subgraph of 
T{R) and a cluster partition CS is a partition of the nodes of T{R) into node-disjoint clusters. Since T{R) 
is a binary tree with 0{m) nodes, a simple top-down procedure provides the following result (see e.g. [8]): 

Lemma 2 Given a regular expression R of length m and a parameter x, a cluster partition CS of T{R) can 
he constructed in 0{m) time such that \CS\ — 0{\m/x'\), and for any C G CS, the number of nodes in C is 
at most X. 

For a cluster partition CS, edges adjacent to two clusters are external edges and all other edges are internal 
edges. Contracting all internal edges 'm CS induces a macro tree, where each cluster is represented by a 
single macro node. Let Cv and Cw be two clusters with corresponding macro nodes v and w. We say that 
Cv is the parent cluster (resp. child cluster) of Cw if v is the parent (resp. child) of w in the macro tree. 
The root cluster and leaf clusters are the clusters corresponding to the root and the leaves of the macro tree. 
An example clustering of a parse tree is shown in Fig. [21b). Given a cluster partition CS of T{R) we show 
how to divide N{R) into a set of small nested TNFAs. Each cluster C G CS will correspond to a TNFA A, 
and we use the terms child, parent, root, and leaf for the TNFAs in the same way we do with clusters. For 
a cluster C G CS with children Ci, . . . , C;, insert a special pseudo-node Pi, 1 < i < I, vcl the middle of the 
external edge connecting C with Ci. We label each pseudo-node by a special character /3 ^ E. Let Tc be 
the tree induced by the set of nodes in C and {pi, . . . ,pi\. Each leaf in Tc is labeled with a character from 
S U {/?}, and hence Tc is a well- formed parse tree for some regular expression Rc over S U {/?}. Now, the 
TNFA A corresponding to C is N{Rc). In A, child TNFA Ai is represented by its start and accepting state 
and 4>Ai and a pseudo-transition labeled (3 connecting them. An example of these definitions is given in 
Fig. 121 We call any set of TNFAs obtained from a cluster partition as above a nested decomposition AS of 
N{R). 
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Figure 2: (a) The parse tree for the regular expression ac\a*b. (b) A clustering of (a) into node-disjoint 
connected subtrees Ci, C2, and C3, each with at most 3 nodes, (c) The clustering from (b) extended with 
pseudo-nodes, (d) The nested decomposition of N{ac\a*b). (e) The TNFA corresponding to Ci. 

Lemma 3 Given a regular expression R of length m and a parameter x, a nested decomposition AS of N{R) 
can be constructed in 0{m) time such that \AS\ = 0{\m/x\), and for any A e AS, the number of states in 
A is at most x. 

Proof. Construct the parse tree T{R) for R and build a cluster partition CS according to Lemma |5| with 
parameter y — f — ^- From CS build a nested decomposition AS as described above. Each C £ CS 
corresponds to a TNFA A e AS and hence 1^51 = 0{\m/y~\) = 0{\m/x'\). Furthermore, if \V{C)\ < y we 
have |F(rc)| < 2y + 1. Each node in Tc contributes two states to the corresponding TNFA A, and hence 
the total number of states in A is at most Ay + 2 = x. Since the parse tree, the cluster partition, and the 
nested decomposition can be constructed in 0(m) time the result follows. □ 



3.2 Simulating Large Automata 

We now show how N{R) can be simulated using the TNFAs in a nested decomposition. For this purpose 
we define a simple data structure to dynamically maintain the TNFAs. Let AS be a nested decomposition 
of N{R) according to Lemma 13 for some parameter x. Let A £ AS be a TNFA, let Sa be a state-set of A, 
let s be a state in A, and let a £ T,. A simulation data structure supports the 4 operations: Move^(S'^, a), 
Close^(S'^), MemberA{SA, s), and lnsert^(5yi, s). Here, the operations Move^ and Closer are defined exactly 
as in Sec. El with the modification that they only work on A and not N{R). The operation Memberyi(S'^, s) 
returns yes if s G Sa and no otherwise and Insert a{S a, s) returns the set Sa U {s}. 

In the following sections we consider various efhcient implementations of simulation data structures. For 
now assume that we have a black-box data structure for each A £ AS. To simulate N{R) we proceed as 
follows. First, fix an ordering of the TNFAs in the nested decomposition AS, e.g., by a preorder traversal 
of the tree represented given by the parent/child relationship of the TNFAs. The collection of state-sets for 
each TNFA in AS are represented in a state-set array X of length \ AS\. The state-set array is indexed by 
the above numbering, that is, X[i] is the state-set of the ith TNFA in AS. For notational convenience we 
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write X[A] to denote the entry in X corresponding to A. Note that a parent TNFA share two states with 
each child, and therefore a state may be represented more than once in X. To avoid comphcations we will 
always assure that X is consistent, meaning that if a state s is included in the state-set of some TNFA, then 
it is also included in the state-sets of all other TNFAs that share s. li S = [J^^y^^g X[A] we say that X 
models the state-set S and write S = X. 

Next we show how to do a state-set simulation of N{R) using the operations Movers and Closers, which 
we define below. These operations recursively update a state-set array using the simulation data structures. 
For any A e AS, state-set array X, and a e S define 

Mo\/eAsiA,X,a): 1. X[A] := Mo\/eA{X[A], a) 

2. For each child Ai of A in topological order do 

(a) X := MoveAs{Ai,X,a) 

(b) If(f>A, e X[Ai] then X[A] := lnsertA(X[A], </.aJ 

3. Return X 

C\o5eAs{A,X): 1. X[A] := C\oseA{X[A]) 

2. For each child Ai of A in topological order do 

(a) If 6'a,, e X[A] then X[Ai\ := Insert a, {X[Ai], 9 a,) 

(b) X := CloseAs(A„X) 

(c) If (t)A, e X[Ai\ then X[A] := lnsertA(X[A], (/-aJ 

(d) X[A] := C\o5eA{X[A]) 

3. Return X 

The Movcas and CloseAS operations recursively traverses the nested decomposition top-down processing 
the children in topological order. At each child the shared start and ac;cepting states are propagated in the 
state-set array. For simplicity, we have written Member a using the symbol e. 

The state-set simulation of N{R) on a string Q of length n produces the sequence of state-set arrays 
Xq, . . . , Xn as follows: Let A^ be the root automaton and let X be an empty state-set array (all entries in 
X are 0). Initially, set X[Ar] := lnsertA^(X[Ar], ^a^) and compute Xq := CloseAs(Ar, CloseAs(Ar) AT)). For 
i > we compute Xj from Xj_i as follows: 

Xi := C\osQAs{Ar,OosQAs{Ar,Mo\ieAs{Ar,X,^i,Q[i]))) 

Finally, we output Q £ L{R) iff ^a^ S X„[Ar]. To see that this algorithm correctly solves REGULAR 
Expression Matching it suffices to show that for any i, Q < i < n, Xi correctly models the ith state-set 
Si in the standard state-set simulation. We need the following lemma. 

Lemma 4 Let X he a state-set array and let A^ he the root TNFA in a nested decomposition AS. If S is 
the state-set modeled hy X, then 

• Move(S', a) = MoM&As{Ar,X,a) and 

• Close(5) = Q\os&As{Ar,Oo%eAs{Ar,X)). 

Proof. First consider the Movbas operation. Let A be the TNFA induced by all states in A and descendants 
of A in the nested decomposition, i.e.. A is obtained by recursively "unfolding" the pseudo-states and 
pseudo-transitions in A, replacing them by the TNFAs they represent. We show by induction that the 
state-array Xa := MoveAs(A, X, a) models Move(S', a) on A. In particular, plugging m A = A^, we have 
that MoyeAs(Ar,X,a) models Move(5, a) as required. 

Initially, line 1 updates X[A\ to be the set of states reachable from a single a-transition in A. If A is 
a leaf, line 2 is completely bypassed and the result follows immediately. Otherwise, let Ai, . . . ,Ai be the 
children of A in topological order. Any incoming transition to a state ^Ai or outgoing transition from a 
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state (pAi is an e-transition by Thompson's construction. Hence, no endpoint of an a-transition in A can 
be shared with any of the children Ai, . . . , Ai. It follows that after line 1 the updated X[A] is the desired 
state-set, except for the shared states, which have not been handled yet. By induction, the recursive calls 
in line 2(a) handle the children. Among the shared states only the accepting ones, (f>Ai , ■ ■ • , 4'Ai , may be the 
endpoint of an a-transition and therefore line 2(b) computes the correct state-set. 

The Closers operation proceeds in a similar, though slightly more complicated fashion. Let Xa be the 
state-array modeling the set of states reachable via a path of forward e-transitions in A, and let Xa be the 
state array modelling Close(S') in A. We show by induction that if X'J^ := Closers (A, X) then 

Xa Q X'J^ C Xa, 

where the inclusion refers to the underlying state-sets modeled by the state-set arrays. Initially, line 1 updates 
X[A] := C\oseA{X[A]). If A is a leaf then clearly X'^ = Xa- Otherwise, let Ai, ... ,Ai be the children of A 
in topological order. Line 2 recursively update the children and propagate the start and accepting states in 
(a) and (c). Following each recursive call we again update X[A] := C\oseA{X[A]) in (d). No state is included 
in X'J^ if there is no e-path in A or through any child of A. Furthermore, since the children are processed in 
topological order it is straightforward to verify that the sequence of updates in line 2 ensure that X'^ contain 
all states reachable via a path of forward e-transitions in A or through a child of A. Hence, by induction we 
have Xa C X'J^ C Xa as desired. 

A similar induction shows that the state-set array Closeyis(^r, X") models the set of states reachable from 
X" using a path consisting of forward e-transitions and at most 1 back transition. However, by Lemma ^ 
this is exactly the set of states reachable by a path of e-transitions. Hence, OoseAsi^r, X") models Close(S') 
and the result follows. □ 



By Lemma 01 the state-set simulation can be done using the Closers and Moveyig operations and the 
complexity now directly depends on the complexities of the simulation data structure. Putting it all together 
the following reduction easily follows: 

Lemma 5 Let R be a regular expression of length m over alphabet E and let Q a string of length n. Given 
a simulation data structure for TNFAs with x < m states over alphabet S U {/3}, where /3 ^ S, that supports 
all operations in 0{t{x)) time, using 0{s[x)) space, and 0{p{x)) preprocessing time, REGULAR EXPRESSION 
Matching for R and Q can be solved in 0( ""'^*^^'* 4- HiiE^il) time using 0{ "^^J^^ ) space. 

Proof. Given R first compute a nested decomposition AS of N{R) using Lemma 13 for parameter x. For 
each TNFA A G AS sort ^'s children to topologically and keep pointers to start and accepting states. By 
Lemma O and since topological sort can be done in 0{m) time this step uses 0{m) time. The total space 
to represent the decomposition is 0{m). Each A G AS is a TNFA over the alphabet S U {/3} with at most 
X states and \AS\ = O(^). Hence, constructing simulation data structures for all A g AS uses 0(!2:^£i) 
time and (9(I!Hl2il) space. With the above algorithm the state-set simulation of N{R) can now be done in 
0( ""'^*^^'* ) time, yielding the desired complexity. □ 

The idea of decomposing TNFAs is also present in Myers' paper [8], though he does not give a "black- 
box" reduction as in Lemma |31 We believe that the framework provided by Lemma |S1 helps to simplify the 
presentation of the algorithms significantly. We can restate Myers' result in our setting as the existence 
of a simulation data structure with 0(1) query time that uses 0{x ■ 2^) space and preprocessing time. 
For X < log(n/logn) this achieves the result mentioned in the introduction. The key idea is to encode 
and tabulate the results of all queries (such an approach is frequently referred to as the "Four Russian 
Technique" [2] ) . Bille and Farach [4] give a more space-efficient encoding that does not use Lemma |S1 as 
above. Instead they show how to encode all possible simulation data structures in total 0{2^ + m) time and 
space while maintaining 0(1) query time. 
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In the following sections we show how to c;flicic!ntly avoid the largx^ tables needed in the previous ap- 
proaches. Instead we implement the operations of simulation data structures using the word-level parallelism 
of the machine model. 

4 A Simple Algorithm 

In this section we present a simple simulation data structure for TNFAs, and develop some of the ideas for 
the improved result of the next section. Let ^ be a TNFA with m = 0{\/w) states. We will show how to 
support all operations in 0(1) time using 0(m) space and 0{m?) preprocessing time. 

To build our simulation data structure for A, first sort all states in A in topological order ignoring the 
back transitions. We require that the endpoints of an a-transition are consecutive in this order. This is 
automatically guaranteed using a standard 0(m) time algorithm for topological sorting (see e.g. [5]). We will 
refer to states in A by their rank in this order. A state-set of A is represented using a bitstring S = S1S2 ■ ■ - Sm 
defined such that s, = 1 iff node i is in the state-set. The simulation data structure consists of the following 
bitstrings: 

• For each a e E, a string Da = di . . .dm such that rfj = 1 iff i is an a-state. 

• A string E = Oeiaei,2 • • • ei,„i0e2, where Cij = 1 iff i is e-reachable 

from j. The zeros are test bits needed for the algorithm. 

• Three constants / = (10™)™, X = l(0™l)™-\ and C = l(0™-il)™-i. Note that / has a 1 in each 
test bit position^. 

The strings E, I, X, and C are easily computed in 0{rn?) time and use O(m^) bits. Since m = 0{y/w) 
only 0(1) space is needed to store these strings. We store in a hashtable indexed by a. Since the total 
number of different characters in A can be at most m, the hashtable contains at most m entries. Using 
perfect hashing Da can be represented in 0{m) space with 0(1) worst-case lookup time. The preprocessing 
time is expected 0{m) w.h.p.. To get a worst-case bound we use the deterministic dictionary of Hagerup et. 
al. [6] with O(mlogTO) worst-case preprocessing time. In total the data structure requires 0{m) space and 
0{m^) preprocessing time. 

Next we show how to support each of the operations on A. Suppose S = Si . . . is a bitstring repre- 
senting a state-set of A and a € E. The result of Move^(S', a) is given by 



This should be understood as C notation, where the right-shift is unsigned. Readers familiar with the Shift- 

Or algorithm [3] will notice the similarity. To see the correctness, observe that state i is put in S' iff state 
{i — 1) is in S and the zth state is an a-state. Since the endpoints of a-transitions are consecutive in the 
topological order it follows that S' is correct. Here, state {i — 1) can only influence state i, and this makes 
the operation easy to implement in parallel. However, this is not the case for Closeyi. Here, any state can 
potentially affect a large number of states reachable through long e-paths. To deal with this we use the 
following steps. 



We describe in detail why this, at flrst glance somewhat cryptic sequence, correctly computes S' as the result 
of Closer (5). The variables Y and Z are simply temporary variables inserted to increase the readability of 

^We use exponentiation to denote repetition, i.e., 1^0 = 1110. 



S' := (5» !)&£>„. 



Z 
S' 



Y 



{SxX)kE 

((F I /) - (7 » m)) & / 

{{Z X C) «w- m{m + 1)) »w-m 
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the computation. Let S = Si . . . Sm 
each copy, that is, 

SxX 

The bitwise & with E gives 

Y = Oyi, 12/1,2 • ■ • yi,m0y2,l2/2,2 • ■ • 2/2,mO . . . 0ym,iym,2 ■ ■ • Vm.Tn, 

where yij = 1 iff state j is in S and state i is e-reachable from j. In other words, the substring Yi = yi^i . . . yi^m 
indicates the set of states in S that have a path of e-transitions to i. Hence, state i should be included in 
Close^(5') precisely if at least one of the bits in is 1. This is determined next. First {Y\I) — {I >> m) sets 
all test bits to 1 and subtracts the test bits shifted right by m positions. This ensures that if all positions in 
Yi are 0, the ith test bit in the result is and otherwise 1. The test bits are then extracted with a bitwise & 
with /, producing the string Z = ZiO"^Z20"^ • • • ZmO"\ This is almost what we want since = 1 iff state i is 
in Closer (5). The final computation compresses the Z into the desired format. The multiplication produces 
the following length 2rn^ string: 

ZxC = ziO™Z20™ . . . z™0" X l(0"-H)"-i 

= ziO™-iziZ20"-2 • • • zi . . . ZfeO™-'= • • • zi . . . z™_iOzi . . . Zm0z2 ■ ■ ■ ■ ■ ■ 0'=Zfc+i . . . z„ • • • 0'"-iz„0'" 

In particular, positions m{m — 1) + 1 through (from the left) contain the test bits compressed into a 
string of length m. The two shifts zeroes all other bits and moves this substring to the rightmost position in 
the word, producing the final result. Since m = 0{^/w) all of the above operations can be done in constant 
time. 

Finally, observe that Insert^ and Member^ are trivially implemented in constant time. Thus, 

Lemma 6 For any TNFA with m = 0{y/w) states there is a simulation data structure using 0{m) space 
and 0{m?) preprocessing time which supports all operations in 0(1) time. 

The main bottleneck in the above data structure is the string E that represents all e-paths. On a TNFA 

with m states E requires at least bits and hence this approach only works for m = 0{-\/w). In the next 
section we show how to use the structure of TNFAs to do better. 

5 Overcoming the e-closure Bottleneck 

In this section we show how to compute an e-closure on a TNFA with m = 0{w) states in O(logm) time. 
Compared with the result of the previous section we quadratically increase the size of the TNFA at the 
expense of using logarithmic time. The algorithm is easily extended to an efficient simulation data structure. 
The key idea is a new hierarchical decomposition of TNFAs described below. 

5.1 Partial-TNFAs and Separator Trees 

First we need some definitions. Let A be a TNFA with parse tree T. Each node ?; in T uniquely corresponds 

to two states in A, namely, the start and accepting states 9a' and (6^' of the TNFA A' with the parse tree 
consisting of v and all descendants of v. We say v associates the states S{v) = {6*^', </'A'}- general, if C 
is a cluster of T, i.e., any connected subgraph of T, we say C associates the set of states S{C) = [Jy^cS{v). 
We define the partial-TNFA (pTNFA) for C, as the directed, labeled subgraph of A induced by the set of 
states S(C). In particular, A is a pTNFA since it is induced by S{T). The two states associated by the 
root node of C are defined to be the start and accepting state of the corresponding pTNFA. We need the 
following result. 

Lemma 7 For any pTNFA P with m > 2 states there exists a partitioning of P into two subgraphs Po and 
Pi such that 



. Initially, SxX concatenates m copies of S with a zero bit between 
= si . . . s„ X 1(0™!)™-! = (Osi . . . s^r. 



9 



X('.) = {ep,,,^pJ 




(a) (b) 
Figure 3: (a) Inner and outer pTNFAs. (b) The corresponding separator tree construction. 

(i) Po o-iT-d Pi are pTNFAs with at most 2 /3m + 2 states each, 

(a) any transition from Pq to Pi ends in 9pj and any transition from Pi to Pq starts in (fipj, and 
(Hi) the partitioning can be computed in 0{m) time. 

Proof. Let P be pTNFA with m > 2 states and let C be the corresponding cluster with t nodes. Since 
C is a binary tree with more than 1 node, Jordan's classical result [7] establishes that we can find in 0{t) 
time an edge e in C whose removal splits C into two clusters each with at most 2 /it + 1 nodes. These two 
clusters correspond to two pTNFAs, Pq and Pi, and since m = 2t each of these have at most 2/3m + 2 
states. Hence, (i) and (iii) follows. For (ii) assume w.l.o.g. that Pq is the pTNFA containing the start and 
accepting state of P, i.e., 9p^ = dp and — (pp. Then, Pq is the pTNFA obtained from P by removing 
all states of Pi. From Thompson's construction it is easy to check that any transition from Pq to Pi ends 
in 9pj and any transition from Pi to Pq must start in 4>pj. □ 

Intuitively, if we draw P, Pi is "surrounded" by Pq, and therefore we will often refer to Pi and Pq as the 
inner pTNFA and the outer pTNFA, respectively (see Fig. (SJa)). Applying Lemma recursively gives the 
following essential data structure. Let P be a pTNFA with ni states. The separator tree for P is a binary, 
rooted tree B defined as follows: If m = 2, i.e., P is a trivial pTNFA consisting of two states Op and (/)p, 
then _B is a single leaf node v that stores the set X{v) — {6'p, (f)p}. Otherwise (m > 2), compute Pq and P/ 
according to Lemma[2| The root v oi B stores the set X{v) — {9pj, (f>Pj}, and the children of v are roots of 
separator trees for Pq and P/, respectively (see Fig.|2Ib)). 

With the above construction each node in the separator tree naturally correspond to a pTNFA, e.g., the 
root corresponds to P, the children to P/ and Pq , and so on. We denote the pTNFA corresponding to node v 
in B by P{v). A simple induction combined with LemmalJfi) shows that if u is a node of depth k then P(v) 
contains at most (|)''m + 6 states. Hence, the depth of B is at most d ~ log3/2 ™ + 0(1). By LemmaEJiii) 
each level of B can be computed in 0{m) time and thus B can be computed in 0(m log m) total time. 

5.2 A Recursive e-Closure Algorithm 

We now present a simple e-closure algorithm for a pTNFA, which recursively traverses the separator tree B. 
We first give the high level idea and then show how it can be implemented in 0(1) time for each level of B. 
Since the depth of B is O(logTO) this leads to the desired result. For a pTNFA P with m states, a separator 
tree B for P, and a node u in P define 

Closep(,j) (5): 1. Compute the set Z C X{v) of states in X{v) that are e-reachable from 5* in P{v). 

2. If w is a leaf return S" :~ Z , else let u and w be the children of respectively: 
(a) Compute the set G C V{P{v)) of states in P{v) that are e-reachable from Z. 
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(b) Return S" := Closep(„) ((S* U G) n y(P(u))) U Closep(,„) ((S* U G) n ^(PH)). 



Lemma 8 For any node v in the separator tree of a pTNFA P, Closep(„) (S") computes the set of states in 
P{v) reachable via a path of e-transitions. 

Proof. Let S be the set of states in P{v) reachable via a path of e-transitions. We need to show that S = S'. 
It is easy to check that any state in S" is reachable via a path of e-transitions and hence S' C S. We show 
the other direction by induction on the separator tree. If v is leaf then the set of states in P{v) is exactly 
X{v). Since S' = Z the claim follows. Otherwise, let u and w be the children of u, and assume w.l.o.g. that 
X{v) = {^p(u)5 0p(m)}- Consider a path p of e-transitions from state s to state s' . There are two cases to 
consider: 

Case 1: s' E V{P{u)). If p consists entirely of states in P{u) then by induction it follows that s' G 
Closep(„)(5' n V{P{u))). Otherwise, p contain a state from P{w). However, by LemmaCfii) 6'p(„) is 
on p and hence 6'p(„) G Z. It follows that s' € G and therefore s' £ Closep(„)(G n V{P{u))). 

Case 2: s' E V{P{w)). As above, with the exception that ipp^u) is now the state in Z. 

In all cases s' e S' and the result follows. □ 



5.3 Implementing the Algorithm 

Next we show how to efficiently implement the above algorithm in parallel. The key ingredient is a compact 
mapping of states into positions in bitstrings. Suppose B is the separator tree of depth d for a pTNFA 
P with m states. The separator mapping M maps the states of P into an interval of integers [1,/], where 
1^3-2'^. The mapping is defined recursively according to the separator tree. Let v be the root of B. If w is a 
leaf node the interval is [1, 3]. The two states of P, Op and (j)p, are mapped to positions 2 and 3, respectively, 
while position 1 is left intentionally unmapped. Otherwise, let u and w be the children of v. Recursively, 
map P{u) to the interval [l,Z/2] and P{w) to the interval [1/2 + 1,1]. Since the separator tree contains at 
most 2'^ leaves and each contribute 3 positions the mapping is well-defined. The size of the interval for P is 
1 = 3- 2'°^3/2 m+o(i) 0{m). We will use the unmapped positions as test bits in our algorithm. 

The separator mapping compactly maps all pTNFAs represented in B into small intervals. Specifically, if 
I! is a node at depth k in P, then P(u) is mapped to an interval of size 1/2^ of the form [(i — 1) • ^ -I- 1, z • 
for some 1 < i < 2'"'. The intervals that correspond to a pTNFA P{v) are mapped and all other intervals 
are unmapped. We will refer to a state s of P by its mapped position M{s). A state-set of P is represented 
by a bitstring S such that, for all mapped positions i, S[i] = 1 iff the i is in the state-set. Since m — 0{w), 
state-sets are represented in a constant number of words. 

To implement the algorithm we define a simple data structure consisting of four length / bitstrings X^, 
X^, E^, and P^ for each level k of the separator tree. For notational convenience, we will consider the 
strings at level k as two-dimensional arrays consisting of 2*^ intervals of length 1/2^, i.e., X^[i,j] is position j 
in the ith interval of X^. If the ith interval at level k is unmapped then all positions in this interval are in 
all four strings. Otherwise, suppose that the interval corresponds to a pTNFA P{v) and let X{v) = {6y, 0^,}. 
The strings are defined as follows: 

1 iff 9i, is e- reachable in P{v) from state j, 
1 iff state j is e-reachable in P{v) from 6'„, 
1 iff is e-reachable in P{v) from state j, 
1 iff state j is e-reachable in P{v) from (f)y. 

In addtion to these, we also store a string Ik containing a test bit for each interval, that is, Ik[i,j] = 1 iff 
j = 1. Since the depth of B is O(logm) the strings use O(logm) words. With a simple depth-first search 
they can all be computed in 0{m\ogm) time. 
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Let 5 be a bitstring representing a state-set of A. We implement the operation Closer (5) by computing 
a sequence of intermediate strings Sq, . . . , Sd each corresponding to a level in the above recursive algorithm. 
Initially, So ■— S and the final string Sd is the result of CloseA('S'). At level k, < k < d, we compute 5^+1 
from Sk as follows. Let t — 1/2'^ — 1. 



1 






{{Y'\h)-{Ik »t))klk 




Z" - {Z" » t) 


G' 


:= F' & Ft 




:= Sk k 


Z't' 


{{Y^\h)-{Ik»t))kh 


p4> 


■ - Z't" - {Z't' » t) 


G'l' 


■= F't' k 


•k+l 


:= 5fc 1 1 G'l' 



We argue that the computation correctly simulates (in parallel) a level of the recursive algorithm. Assume 
that at the beginning of level k the string Sk represents the state-set corresponding the recursive algorithm 
after k levels. We interpret Sk as divided into r — 1/2'' intervals of length t = 1/2'' — 1, each prefixed with a 
test bit, i.e., 

Sk = 0si^iSi^2 • • • Sl,tOS2,lS2,2 • • ■ S2jO ■ ■ ■ 0Sr,lSr,2 ■ ■ ■ Sr,t 

Assume first that all these intervals are mapped intervals corresponding to pTNFAs P{vi), . . . , P{vr), and 
let X{vi) = {^i,., (/)„.}, 1 < I < r. Initially, Sk k produces the string 

= OyiSyi.2 ■ ■ ■ yi, (02/242/2, 2 ■ • • 2/2, tO . . . 0yrsyr,2 ■ ■ ■ yr,t, 

where yij 1 iff 0y. is e-reachable in P{vi) from state j and j is in Sk- Then, similar to the second line in 
the simple algorithm, {Y^ \ Ik) — [Ik » t) k Ik produces a string of test bits Z^ = ziO*Z20* . . . z^O*, where 
= 1 iff at least one of t/i.i . . . yi^t is 1- In other words, Zi = 1 iff 9y. is e-reachable in P{vi) from any state 
in Sk n V{P{vi)). Intuitively, the Z^ corresponds to the "0-part" of the of Z-set in the recursive algorithm. 
Next we "copy" the test bits to get the string F'^ = Z'^ - {Z'^ » t) = Qz{{)zl . . .Oz*. The bitwise k with 
Fl gives 

= 0.gi,igi,2 ■ ■ ■ 5l,t0324.92,2 ■ • ■ 32,(0 . . . Qgr,igr,2 ■ ■ ■ gr,t- 

By definition, gij = 1 iff state j is e-reachable in P{vi) from 9y. and Zi = 1. In other words, G^ represents, 
for 1 < i < r, the states in P{vi) that are e-reachable from Sk H V{P{vi)) through 9^.. Again, notice the 
correspondance with the G-set in the recursive algorithm. The next 4 lines are identical to first 4 with the 
exception that 9 is exchanged by (j). Hence, G"^ represents the states that e-reachable through 0^,^, . . . , (p^^. 

Finally, Sk\G^ \ G'l' computes the union of the states in Sk, G^, and G'^ producing the desired state-set 
Sk+i for the next level of the recursion. In the above, we assumed that all intervals were mapped. If this 
is not the case it is easy to check that the algorithm is still correct since the string in our data structure 
contain Os in all unmapped intervals. The algorithm uses constant time for each of the d = O(logm) levels 
and hence the total time is O(logm). 

5.4 The Simulation Data Structure 

Next we show how to get a full simulation data structure. First, note that in the separator mapping the 
endpoints of the a-transitions are consecutive (as in Sec.0J. It follows that we can use the same algorithm 
as in the previous section to compute Move^i in 0(1) time. This requires a dictionary of bitstrings, D^, 
using additional 0{m) space and 0(m log m) preprocessing time. The Insert^, and Memberyi operations are 
trivially implemented in 0(1). Putting it all together we have: 
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Lemma 9 For a TNFA with m — 0{w) states there is a simulation data structure using 0{m) space and 
0(m log m) preprocessing time which supports all operations in O(logm) time. 

Combining the simulation data structures from Lemmas and |51 with the reduction from Lemma [S] and 
taking the best result gives Theorem ^ Note that the simple simulation data structure is the fastest when 
m — 0{y/w) and n is sufficiently large compared to m. 

6 Remarks and Open Problems 

The presented algorithms assume a unit-cost multiplication operation. Since this operation is not in AC'^ 
(the class of circuits of polynomial size (in w), constant depth, and unbounded fan-in) it is interesting 
to reconsider what happens with our results if we remove multiplication from our machine model. The 
simulation data structure from Sec. ^ uses multiplication to compute Close^ and also for the constant time 
hashing to access Da. On the other hand, the algorithm of Sec. |5l only uses multiplication for the hashing. 
However, Lemma El still holds since we can simply replace the hashing by binary search tree, which uses 
O(logm) time. It follows that Theoremnstill holds except for the 0{n + m?) bound in the last line. 

Another interestring point is to compare our results with the classical Shift-Or algorithm by Baeza- Yates 
and Gonnet [3] for exact pattern matching. Like ours, their algorithm simulates a NFA with m states using 
word-level parallelism. The structure of this NFA permits a very efficient simulation with an 0{w) speedup 
of the simple 0{nm) time simulation. Our results generalize this to regular expressions with a slightly worse 
speedup of 0{w / \ogw). We wonder if it is possible to remove the O(logw) factor separating these bounds. 

From a practical viewpoint, the simple algorithm of Sec. 01 seems very promising since only about 15 
instructions are needed to carry out a step in the state-set simulation. Combined with ideas from [10] we 
believe that this could lead to a practical improvement over previous algorithms. 
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